While everybody buys and consumes numerous food-related products every day, little care is given in the precise products’ compositions. For the sake of this study, we decided to focus our effort on the Open Food Facts dataset regrouping the information of a very broad variety of products, including their nutrition facts and derived nutrition scores from different standards, such as the UK Food Standards Agency's (FSA) or the same score but derived from the French market
In particular, this study is focused on the quantitative comparison of organic and regular products. Is there a difference between the two ? An effort has been made to split the products into different categories in order to compare only similar kinds of products and carry out a more rigorous analysis.
This notebook contains all the code which has been used to generate the plots and figures of our datastory.
We import required packages as well as the dataset itself.
import pickle
import pprint
import time
import sys
import os
import json
import copy
import folium
import itertools
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import re
import seaborn as sns
import statsmodels.formula.api as sm
import plotly
from plotly import graph_objs as go, tools
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer
from string import punctuation
from math import pi
# We generate the interactive plots with Altair
import altair as alt
# Make sure to download the dataset and place it in the data folder
OPEN_FOOD_FACTS_PATH = 'data/en.openfoodfacts.org.products.csv'
ENCODING = 'UTF-8'
chunksize = 100000
tfr = pd.read_csv(OPEN_FOOD_FACTS_PATH, encoding=ENCODING, sep='\t', chunksize=chunksize, iterator=True, low_memory=False)
food_facts_df = pd.concat(tfr, ignore_index=True)
food_facts_df.head(5)
print("There are {} rows, hence products, in the dataset.".format(food_facts_df.shape[0]))
print("There are {} columns, hence fields, in the dataset.".format(food_facts_df.shape[1]))
The dataset description is available here.
This dataset is provided with a text file describing the different fields. We present a brief overview of the main types of fields:
code containing the product's code. creator indicating who added the product to the dataset. countries showing where the product is selled. _t are dates in the UNIX timestamp format (number of seconds since Jan 1st 1970)_datetime are dates in the iso8601 format: yyyy-mm-ddThh:mn:ssZ_tags are comma separated list of tags (e.g. categories_tags is the set of normalized tags computer from the categories field)_100g correspond to the amount of a nutriment (in g, or kJ for energy) for 100 g or 100 ml of product_serving correspond to the amount of a nutriment (in g, or kJ for energy) for 1 serving of the productA lot of products claim to be bio. Different terms are used depending on the location (organic, bio, ...), but overall they all refer to the same fact that the product was mostly produced in compliance with the standards of organic farming. In the common knowledge, people often claim that organic or bio products are healthier than non-organic products.
In the upcoming analysis, we will try to investigate whether this statement is quantitavely true or if sometimes companies take advantage of the "bio" acronym to gain market shares.
We will conduct this analysis by investigating different columns of interest:
Some product contain words such as 'bio', 'biologic', 'organic' in their product_name, so this can be use to distinguish them from normal products. In addition, the dataset contains the columns labels, labels_tags and labels_en, which contain information about quality labels/certifications which also include organic products.
Thus, these columns can be used to split the dataset into organic products and regular products. We will seach for keywords in these columns to determine if a product is organic:
bio_keywords = ['bio', 'organi'] # bio --> bio, biological, biologique, etc.; organi --> organic, organique
contains_bio_keywords = lambda x: any([(kw in str(x)) for kw in bio_keywords])
Let's add a new boolean column to the dataframe, to specify is the product is organic:
# Check for products matching the bio keywords in the 4 columns
bio_products = (food_facts_df['product_name'].apply(contains_bio_keywords)) \
| (food_facts_df['labels'].apply(contains_bio_keywords)) \
| (food_facts_df['labels_tags'].apply(contains_bio_keywords)) \
| (food_facts_df['labels_en'].apply(contains_bio_keywords))
food_facts_df['bio'] = bio_products
print('There are {} organic products, and {} regular products.'.format(food_facts_df['bio'].sum(),
len(food_facts_df['bio']) - food_facts_df['bio'].sum()))
food_facts_df[food_facts_df['bio'] == True].head(5)
We now have a new column in our dataset indicating whether products are organic or regular.
Since there is a different representation of product categories between regular and organic products, balancing the datasets as much as possible is essential to provide insightful information regarding both categories. This was the main shortcomming of milestone 2 and this is adressed here.
For our analysis, we want to compare organic and regular products in terms of nutrional score, contained additives, nova groups and more. However, as shown in the previous section we have two highly unbalanced classes, as there are much fewer organic products. In addition to this, we don't know if the distribution of product categories is different between the regular and organic data. For instance, it could be possible that for organic products there are much more juices than other types of products.
In order to provide a more rigourous analysis, we will try to split the dataset into different categories, and then perform a comparative analysis in each category
The open food facts dataset contains several columns representing the categories. The column that is most present is categories_en. It is composed of a set of categories separated by commas. In addition, there are some metadata inside it to specify the language of the keywords, such en: and fr: for english and french respectively.
Our goal is to find out which categories are actually present in categories_en. Then we can find common categories to split the dataset based on keywords. For instance, if we wanted to find all the dairy products, we could look for the keywords [milk, cheese, yoghurt] in categories_en.
We can also look for keywords related to the category in the product_name column. However, it would be more difficult to obtain a reliable splitting. Indeed, if we had a product named goat milk, it could be classified in the meat category since its name contains goat, whereas it's categories_en columns would not have meat in it. For this reason, we use only the categories_en column.
# Get the relevant column from the dataset
categories_df = food_facts_df[['categories_en']].dropna().copy()
categories_df.head(5)
Our goal is to identify the most common words describing the categories of products across the dataset.
We use the CountVectorizer class to build a bag of words from categories_en.
We leverage the arguments of the class to transform all the words to lowercase, ignore stopwords and perform character normalization (remove accents, etc.). In addition, we discard the words that have less than 50 occurences.
def strip_accents(words):
""" Remove accents in a list of words. """
# If words is [], None or similar
if not words:
return words
# We use the CountVectorizer to actually remove accents
vectorizer = CountVectorizer(lowercase=False, strip_accents='ascii')
vectorizer.fit_transform(words)
return vectorizer.get_feature_names()
# We use the stopwords from nltk (uncomment following 2 lines to download the keywords)
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
# We need french+english stopwords as there are both languages in the column
stop_words = stopwords.words('english')
stop_words.extend(strip_accents(stopwords.words('french')))
# These labels is present in the column to signify french/english categories
stop_words.append('fr')
stop_words.append('en')
# Remove dupplicates if any
stop_words = list(set(stop_words))
# Discard categories that are too rare
min_occurences = 50
vectorizer = CountVectorizer(stop_words=stop_words, lowercase=True,
strip_accents='ascii', min_df=min_occurences)
bag = vectorizer.fit_transform(categories_df['categories_en'].values)
bag_features = vectorizer.get_feature_names()
print('There are {} different words in total.'.format(len(bag_features)))
print('Keywords:', bag_features[:10])
Let's find out the most common words:
# Sum along the columns to get the total number of occurencies of each token
words_occurences = np.array(bag.sum(axis=0))[0]
# Build a dataframe with the words occurencies
words_occurencies_df = pd.DataFrame({'word':bag_features,'count':words_occurences}).sort_values(by=['count'],ascending=False)\
.reset_index().drop(columns='index')
# Plot the most common words
n = 50
plt.figure(figsize=(10,20))
plt.barh(words_occurencies_df.loc[0:n]['word'].values[::-1],
words_occurencies_df.loc[0:n]['count'].values[::-1])
plt.title('Words occurencies of the "categories_en" tag (first {})'.format(n))
plt.ylabel('Word')
plt.xlabel('Nb. occurences')
plt.show()
There is a good list of words that can be used for categories, although some of them are too general, for instance the most common ones which are foods and based.
Let's clean the column categories_en using the same pre-processing and stop-words removal as before:
preprocessor = vectorizer.build_preprocessor()
tokenizer = vectorizer.build_tokenizer()
categories_df['categories_en'] = categories_df['categories_en'].apply(preprocessor).apply(tokenizer)\
.apply(lambda words: ' '.join([w for w in words if w not in stop_words]))
categories_df.head(5)
We can now define a function that can be used to obtain the product that belong to a specific category defined by keywords:
def find_products_from_category(raw_df, pre_processed_df, category_keywords):
""" Obtain the products that fall in a given category.
A product belong to a category if its column `categories_en` contain
one of the category_keywords.
Args:
raw_df: raw Open Food Facts dataframe to be able to fetch all the products information
pre_processed_df: datframe where the columns `categories_en` have been
cleaned of stopwords, case, accents, etc.
category_keywords: List of keywords to find in `categories_en`. Note that since the
pre_processed_df is a clean string, there is no need to include
variations such as `cheese` and `cheeses`, a common step is enough.
Return:
The products that belong to the categories, with all the info from raw_df
"""
products = pre_processed_df[pre_processed_df['categories_en'].apply(
lambda x : any(kw in x for kw in category_keywords))]
return raw_df.loc[products.index]
For instance let's obtain the dairy products:
dairy_kw = ['dairies','milk','cheese']
dairy_df = find_products_from_category(food_facts_df, categories_df, category_keywords=dairy_kw)
print('There are {} dairy products.'.format(dairy_df.shape[0]))
dairy_df.head(10)
There are several possible choice of categories in which we can split the products. For instance, the products could be selected and grouped based on their type, following the food pyramid:

However, this splitting would yield highly unbalanced categories, and we would not produce granular categories.
Instead, we decided to choose a set of categories and sub-categories drawing inspiration from the Ciqual dataset that we investigated during milestone 2. The food products are separated in 3 levels of categories, and we produced the following categories based on it. As we have products in english as well as in french, we provide the french translation of the subcategory to search as well in this languages.
categories = \
{
'meat, fish, egg':
{
'meat': 'viande',
'fish': 'poisson',
'egg': 'oeuf',
},
'fruit, vegetable':
{
'fruit': 'fruit',
'vegetable': 'legume',
'legume': 'Legumineuse',
'seed': 'graine',
},
'cereal based': # Cereal-based products
{
'pasta': 'pate',
'rice': 'riz',
'flour': 'farine',
'bread': 'pain',
'biscuit': 'biscuit',
#'cereals': 'cereales' # Breakfast
},
'beverage': # drink
{
'water': 'eau',
'juice': 'jus',
'soda': 'boisson gazeuse',
'alcohol': 'alcool',
},
'dairy':
{
'milk': 'lait',
'cheese': 'fromage',
'yoghurt': 'yaourt',
'cream': 'creme',
},
'spices, salsa, condiment': # cooking ingredient
{
'salsa': 'sauce',
'spice': 'epice',
'salt': 'sel',
'herb': 'herbe',
'condiment': 'condiment',
},
'oil, butter': #'fat':
{
'oil': 'huile',
'butter': 'beurre',
},
'sugary product':
{
'sugar': 'sucre',
'jam': 'confiture',
'candy': 'bonbon',
'chocolate': 'chocolat',
'ice cream': 'glace',
},
}
We have quite a few categories, but we now need to find keywords associated to each of them. We could create this list manually but this can prove to be quite tedious and time-consuming. Instead we will harness the power of web-scraping to get such lists automatically.
Wikipedia has pages that list food by categories, but the formatting is not homogenous and we would not go faster if we needed to implement a robust scraping method.
Instead, what we can do is use the name of our categories and find their synonyms using synonymy for instance. For each keywords of categories, we can fetch a list of synonym that we cross-reference with the actual categories in our dataset (obtained previously with CountVectorizer). Finally, we manually filter out the words that are too generic or irrelevant.
We will use BeautifulSoup for web scraping the results.
Note: You can directly jump to this cell and load the generated json isntead of scrapping again the data.
import requests
from bs4 import BeautifulSoup
Let's define a function to get the webpage associated to the word of which we want to know the synonyms, and fetch the results. We chose this website because it is easy to scrap, and it has also a french version which uses the same HTML layout !
def get_synonyms(word, language='en'):
""" Fetch the list of synonyms by scrapping the website http://www.synonymy.com,
or its French version http://www.synonymes.com (which is exactly the same).
Args:
word: Word for which to fetch the synonyms
language: 'en' or 'fr'
Return:
A list of string which are the synonyms.
"""
BASE_URL_EN = 'http://www.synonymy.com/synonym.php?word={word}'
BASE_URL_FR = 'http://www.synonymes.com/synonyme.php?mot={word}'
# Build url
if language == 'en':
url = BASE_URL_EN.format(word=word)
elif language == 'fr':
url = BASE_URL_FR.format(word=word)
# Get web page for specified word
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
synonyms = []
# The synonyms on this website are in 'div' with class 'defbox'
for defbox in soup.find_all('div', {'class': 'defbox'}):
synonyms_links = defbox.find_all('a')
synonyms.extend([s.text for s in synonyms_links])
# Remove dupplicates
synonyms = list(set(synonyms))
# Remove empty strings ('')
synonyms = [s for s in synonyms if s]
# Remove accents
synonyms = strip_accents(synonyms)
return synonyms
Let's see how it works:
print('english:', get_synonyms('fruit', 'en'))
print('french:', get_synonyms('fruit', 'fr'))
Now, we have a way to find keywords related to our categories. For each subcategory, we can fetch the synonyms in english and in french. We keep only the synonyms that are present in the list of keywords of categories that we had extracted from the Open Food Facts dataset:
# Build the list of keywords for the sub categories
category_keywords = copy.deepcopy(categories)
for category in categories:
print(category)
for english, french in categories[category].items():
# Fetch synonyms
synonyms = get_synonyms(english, 'en')
synonyms.extend(get_synonyms(french, 'fr'))
# Remove dupplicates
synonyms = list(set(synonyms))
# Keep only categories in Open Food Facts
synonyms = [s for s in synonyms if s in words_occurencies_df.word.values]
# Always include name of sub category
if french != english:
synonyms.insert(0, french)
synonyms.insert(0, english)
category_keywords[category][english] = synonyms
print('\t', english)
print('\t\t', category_keywords[category][english])
As you can see the keywords are quite good, but there are still a few irrelevant keywords (for instance: jam is associated with the words fix and box which are not synonyms in the case of the food.
There are few keywords so that we can manually edit the results and remove the irrelevant or too generic keywords; we can also add some if we feel like there are some missing.
# Save as json for manual editing
with open('data/category_keywords.json', 'w') as outfile:
json.dump(category_keywords, outfile, indent=2)
with open('data/category_keywords_edited.json', 'r') as file:
category_keywords = json.load(file)
pprint.pprint(category_keywords)
With this final list of keywords, we can easily fetch the products that belong to a given sub-category.
We can now split the data in categories. Please note that the categories may overlap ! Indeed, we merely search by keywords and do not impose any constraints on the obtained set of products. This is fine as long as we keep this fact in mind.
Let's fill in a dictionary that will contain a dataframe for each subcategory. For the parent categories, we would simply need to merge the children subcategory dataframes.
sub_categories_df_dict = copy.deepcopy(categories)
parent_categories_df_dict = {}
for category in sub_categories_df_dict:
print(category)
for subcategory in sub_categories_df_dict[category]:
# Get keywords associated to the subcategory
subcategory_kw = category_keywords[category][subcategory]
# Get dataframe of products in the subcategory
sub_categories_df_dict[category][subcategory] = \
find_products_from_category(food_facts_df, categories_df,
category_keywords=subcategory_kw)
print('\t', subcategory)
print('\t\t number of products:', sub_categories_df_dict[category][subcategory].shape[0])
parent_categories_df_dict[category] = pd.concat([df for df in sub_categories_df_dict[category].values()])
print('\t total products:', parent_categories_df_dict[category].shape[0])
In this section we will start generating the interactive plots present in the data story using plotly. To allow their illustration on the data story, we created a (free) account to upload the images on their server which eases the query to illustrate these plots on the data story.
To actually upload them, a key was needed which is of course not shared on this repository.
Retreive plotly account data to connect to their webservice (through chart_studio).
accounts = pd.read_csv('plotly_keys.csv') # This file is ignored by git for obvious reasons
username = accounts.loc[0].username
api_key = accounts.loc[0].key
import chart_studio
import chart_studio.plotly as py
import chart_studio.tools as tls
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
Then we create a function to generate the urls that will be used within iframes in the data story to display interactive plots.
def generate_url(fig, filename='fr_nutri_score', auto_open=True):
'''
Generate the url for the specified plotly figure.
Arguments:
fig: plotly figure
filename: (string) name of the file online
auto_open: (bool) open plot in a new tab for immdiate visualization
'''
link = py.plot(fig, filename=filename, auto_open=auto_open)
print(tls.get_embed(link))
urls_df = pd.read_csv('plotly_urls.csv',index_col='name')
urls_df.loc[filename] = link
urls_df.to_csv('plotly_urls.csv')
The two following methods are used to filter the dataframe according to a single category. This filtered dataframe can then be used for all the plots with respect to this given category.
def compute_category_summary(field, cat='all products', subcat=None):
"""
Compute the proportions of a category of product (cat) for each value of a nutri-score (field).
This function adds to the final dataset the type of the product (regular or organic)
Arguments:
field: (string) dataframe column's name to gather data from (fr or uk nutri-scores, nova)
cat: (string) type of product (e.g. fruits, dairy, cereal based, ...).
The categories are defined in the keys of parent_categories_df_dict and sub_categories_df_dict
subcat: (string) product subcategory (subdivision of the previous category)
Output:
dataframe with columns: "score", "proportion", "type", "category"
score: nutri-score values
proportion: proportion of products with given grade (within the score)
type: "bio" or "regular"
category: category of product (meat, dairy, ...)
"""
# Get the correct data collection
if cat == 'all products':
df = food_facts_df
elif subcat is None and cat in parent_categories_df_dict:
df = parent_categories_df_dict[cat]
elif subcat and cat in parent_categories_df_dict and subcat in sub_categories_df_dict[cat]:
df = sub_categories_df_dict[cat][subcat]
else:
print('Unknown category')
return
summary_bio_df = pd.DataFrame([df[df['bio'] == True][field] \
.value_counts(normalize=True).sort_index().rename('proportion')])
summary_bio_df = summary_bio_df.transpose()
summary_bio_df['type'] = 'bio'
summary_regular_df = pd.DataFrame([df[df['bio'] == False][field] \
.value_counts(normalize=True).sort_index().rename('proportion')])
summary_regular_df = summary_regular_df.transpose()
summary_regular_df['type'] = 'regular'
summary_df = pd.concat([summary_bio_df, summary_regular_df])
summary_df.reset_index(level=0, inplace=True)
summary_df.rename(columns={'index': 'score'}, inplace=True)
# Add single category for filtering latter in the plots
if cat == 'all products':
summary_df['category'] = 'all products'
elif subcat is None and cat in parent_categories_df_dict:
summary_df['category'] = cat
elif subcat and cat in parent_categories_df_dict and subcat in sub_categories_df_dict[cat]:
summary_df['category'] = subcat
return summary_df
def compute_all_categories_summary(field):
"""
Compute a final dataframe with all the categories for a given field from the dataset.
It uses the previous function compute_category_summary(...) and calls it for each product category (meat, dairy, ...).
Argument:
field: (string) dataframe column's name to gather data from (fr or uk nutri-scores, nova)
Output:
dataframe with columns: "score", "proportion", "type", "category"
score: nutri-score values
proportion: proportion of products with given grade (within the score)
type: "bio" or "regular"
category: category of product (meat, dairy, ...)
"""
summary_df = compute_category_summary(field, cat='all products', subcat=None)
for category in sub_categories_df_dict:
summary_df = pd.concat([summary_df, compute_category_summary(field, cat=category, subcat=None)])
for subcategory in sub_categories_df_dict[category]:
summary_df = pd.concat([summary_df,
compute_category_summary(field, cat=category, subcat=subcategory)])
summary_df.reset_index(level=0, drop = True, inplace=True)
return summary_df
food_facts_df
An example of the obtained dataframe when running the previous function. In this case, the nutrition_grade_fr was evaluated over all categories of products and the evaluation yielded the following dataframe:
scores_df = compute_all_categories_summary(field='nutrition_grade_fr')
scores_df
Note: This dataframe will be used to generate the bar plots of the datastory!
Utilities functions definition:
Now we create different utilities function to create the interactive plots. We need these functions as the code for the plots is quite redundant due to the dropwdown we wanted to insert, allowing to display each product category independently.
Each function is described independently.
def compute_sub_element(fig,df,category,ptype):
'''
Function used to create the subtraces representing a single category for a given interactive plot,
being either a barplot, scatterplot or radarplot.
Arguments:
fig: plotly figure
df: dataframe containing the data for this category. This dataframe was filtered a-priori.
category: the current category of product being analyzed.
ptype: the type of the plot: {'bar', 'scatter','radar'}
Returns:
fig: the plotly figure instance with the added traces.
'''
boolean=False
if cat == 'meat, fish, egg':
boolean = True
if ptype == 'bar':
fig.add_trace(go.Bar(x=df[(df['category'] == category) & \
(df['type'] == 'regular')]['score'], \
y=df[(df['category'] == category) & \
(df['type'] == 'regular')]['proportion'],
name='Regular products',
marker_color='#736372',
visible=boolean))
fig.add_trace(go.Bar(x=df[(df['category'] == category) & \
(df['type'] == 'bio')]['score'], \
y=df[(df['category'] == category) & \
(df['type'] == 'bio')]['proportion'],
name='Organic products',
marker_color='#7DCD85',
visible=boolean))
elif ptype == 'scatter':
fig.add_trace(go.Scatter(x=df[(df['category'] == category) & \
(df['type'] == 'regular')]['score'], \
y=df[(df['category'] == category) & \
(df['type'] == 'regular')]['proportion'],
name='Regular products',
marker_color='#736372',
visible=boolean,
fill='tozeroy'))
fig.add_trace(go.Scatter(x=df[(df['category'] == category) & \
(df['type'] == 'bio')]['score'], \
y=df[(df['category'] == category) & \
(df['type'] == 'bio')]['proportion'],
name='Organic products',
marker_color='#7DCD85',
visible=boolean,
fill='tozeroy'))
elif ptype == 'radar':
sub_df = df[(df['category'] == category) & (df['type'] == 'regular')]
sub_df = sub_df.sort_values(by='additive')
theta = list(sub_df['additive'].values)
theta.append(sub_df['additive'].values[0])
r = list(sub_df['proportion'].values)
r.append(sub_df['proportion'].values[0])
fig.add_trace(go.Scatterpolar(r=r, \
theta=theta,
name='Regular products',
marker_color='#736372',
visible=boolean,
fill='toself'))
sub_df = df[(df['category'] == category) & (df['type'] == 'bio')]
sub_df = sub_df.sort_values(by='additive')
theta = list(sub_df['additive'].values)
theta.append(sub_df['additive'].values[0])
r = list(sub_df['proportion'].values)
r.append(sub_df['proportion'].values[0])
fig.add_trace(go.Scatterpolar(r=r, \
theta=theta,
name='Organic products',
marker_color='#7DCD85',
visible=boolean,
fill='toself'))
return fig
def construct_bool_visibility(bool_vector, cur_cat, cat_list):
"""
Small utility function to create a boolean vector for defining which traces should be shown
when a given category is selected in the dropdown menu.
Arguments:
bool_vector: initial boolean vector being of the size of the number of traces, filled with 0s
cur_cat: the name of the category being analyzed
cat_list: the list containing all the category names
Returns:
bool_vector: the boolean vector having 1s at the trace indices to be plotted for this category.
"""
n = len(bool_vector)
idx = cat_list.index(cur_cat)
bool_vector[(2*idx):(2*idx+2)] = True
return bool_vector
Having defined the utilities function above, we can now generate the plot using plotly. The definition of these plots is pretty heavy for generating such a plot, however we can hence tune it to however we want it to look.
A plot generation works as the following:
we then save the plot in html format and upload this figure to the plotly servers and generate the wanted URL to be added in the data story.
updatemenus = [
{
'buttons': list(),
'direction': 'down',
'showactive': True,
'x':0.1,
'xanchor':"left",
'y':1.1,
'yanchor':"top"
}
]
fig = go.Figure()
n_cat = len(list(categories.keys()))
vect_parent = []
for cat in categories.keys():
fig = compute_sub_element(fig, scores_df, cat,'bar')
vect = np.full(n_cat*2, False, dtype=bool)
vect_parent = construct_bool_visibility(vect, cat, list(categories.keys()))
updatemenus[0]['buttons'].append({'method': 'update',
'label': cat,
'args' : [
{'visible': vect_parent}
]})
fig.update_layout(updatemenus=updatemenus,
title=dict(text="Products' French nutritional score distributions",
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
yaxis=dict(
title='Proportion',
titlefont_size=16,
tickfont_size=14,
gridcolor = 'black',
),
xaxis=dict(
title='French Nutritional score',
titlefont_size=16,
tickfont_size=14,
type="category"
),
plot_bgcolor='rgba(0,0,0,0)'
#annotations = [go.layout.Annotation(text="Category:", showarrow=False,x=-0.6, y=1.08, yref="paper", align="left")]
)
fig.update_yaxes()
fig.show()
fig.write_html('FRnutriscore.html')
generate_url(fig, 'fr_nutri_score', auto_open=False)
Here the protocol is exactly the same as the one for the french nutrition score, however we simply use another dataframe as input filtered for the UK score !
scores_df = compute_all_categories_summary(field='nutrition-score-uk_100g')
scores_df
updatemenus = [
{
'buttons': list(),
'direction': 'down',
'showactive': True,
'x':0.1,
'xanchor':"left",
'y':1.1,
'yanchor':"top"
}
]
fig = go.Figure()
n_cat = len(list(categories.keys()))
vect_parent = []
for cat in categories.keys():
fig = compute_sub_element(fig, scores_df, cat, 'scatter')
vect = np.full(n_cat*2, False, dtype=bool)
vect_parent = construct_bool_visibility(vect, cat, list(categories.keys()))
updatemenus[0]['buttons'].append({'method': 'update',
'label': cat,
'args' : [
{'visible': vect_parent}
]})
fig.update_layout(updatemenus=updatemenus,
title=dict(text="Products' UK nutritional score distributions",
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
yaxis=dict(
title='Proportion',
titlefont_size=16,
tickfont_size=14,
gridcolor = 'black',
),
xaxis=dict(
title='UK nutritional score',
titlefont_size=16,
tickfont_size=14
),
plot_bgcolor='rgba(0,0,0,0)',
#annotations = [go.layout.Annotation(text="Category:", showarrow=False,x=-11.0, y=1.08, yref="paper", align="left")]
)
fig.update_yaxes()
fig.show()
fig.write_html('UKnutriscore.html')
generate_url(fig, 'uk_nutri_score', auto_open=False)
First observation: the distribution for both scores nutrition_grade_fr and nutrition-score-uk_100g is always similar. The organic products distributions are always slightly better than regular products but it seems their score remain comparable.
Same as for the categories plots, we just filter the dataframe for the nova group.
scores_df = compute_all_categories_summary(field='nova_group')
scores_df
updatemenus = [
{
'buttons': list(),
'direction': 'down',
'showactive': True,
'x':0.1,
'xanchor':"left",
'y':1.1,
'yanchor':"top"
}
]
fig = go.Figure()
n_cat = len(list(categories.keys()))
vect_parent = []
vect_child = []
for cat in categories.keys():
fig = compute_sub_element(fig, scores_df, cat, 'bar')
vect = np.full(n_cat*2, False, dtype=bool)
vect_parent = construct_bool_visibility(vect, cat, list(categories.keys()))
updatemenus[0]['buttons'].append({'method': 'update',
'label': cat,
'args' : [
{'visible': vect_parent}
]})
fig.update_layout(updatemenus=updatemenus,
title=dict(text="Products' Nova group distributions",
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
yaxis=dict(
title='Proportion',
titlefont_size=16,
tickfont_size=14,
gridcolor = 'black',
),
xaxis=dict(
title='Nova group',
titlefont_size=16,
tickfont_size=14,
type="category"
),
plot_bgcolor='rgba(0,0,0,0)',
#annotations = [go.layout.Annotation(text="Category:", showarrow=False,x=-0.6, y=1.08, yref="paper", align="left")]
)
fig.update_yaxes()
fig.show()
fig.write_html('novascore.html')
generate_url(fig, 'nova_group', auto_open=False)
Note: The nova group does not seem to provide much insight. Some type of organic products are also largely pre-processed.
Let us try to analyze the additives for a given category of products.
First let's define a mapping function to regroup all the additives into 8 different additive groups ! This particular mapping was found on this wikipedia page. The principle for this mapping is that the additives have a code of the form Exxx, where the first x defines the family of the additive. They are listed in the next cell.
# Mapping to simplify the additives representation
#map_func = {1 : 'Colorants', \
# 2 : 'Preservatives', \
# 3 : 'Antioxidants / acidity regulators', \
# 4 : 'Thickeners, stabilisers and emulsifiers', \
# 5 : 'pH regulators and anti-caking agents', \
# 6 : 'Flavour enhancers', \
# 7 : 'Antibiotics', \
# 9 : 'Glazing agents, gases and sweeteners'}
map_func = {1 : 'Colorants', \
2 : 'Preservatives', \
3 : 'Antioxidants', \
4 : 'Thickeners', \
5 : 'Anti-caking agents', \
6 : 'Flavour enhancers', \
7 : 'Antibiotics', \
9 : 'Glazing agents'}
Then we define different utilities function to enable the processing of these additives.
Each function will be defined independently.
def extract_additive_type(x):
""" The category of an additive is defined by the first digit after the
letter 'E'. See https://en.wikipedia.org/wiki/E_number for the full list.
Arguments:
x: additive name
Returns:
(int) group of this additive [1-9]
"""
try:
x = int(str(x)[1]) if len(str(x))>0 else -1
return x
except:
return -1
def compute_category_additives_summary(cat='all products', subcat=None):
"""
Compute the proportions of each additive for a given category (cat) of product.
This function adds to the final dataset the type of the product (regular or organic)
Arguments:
cat: (string) type of product (e.g. fruits, dairy, cereal based, ...).
The categories are defined in the keys of parent_categories_df_dict and sub_categories_df_dict
subcat: (string) product subcategory (subdivision of the previous category)
Output:
dataframe with columns: "additive", "proportion", "type", "category"
additive: which additive
proportion: proportion of products with given additive
type: "bio" or "regular"
category: category of product (meat, dairy, ...)
"""
# Get the correct data collection
if cat == 'all products':
df = food_facts_df
elif subcat is None and cat in parent_categories_df_dict:
df = parent_categories_df_dict[cat]
elif subcat and cat in parent_categories_df_dict and subcat in sub_categories_df_dict[cat]:
df = sub_categories_df_dict[cat][subcat]
else:
print('Unknown category')
return
# Don't drop the NaN rows
df = df[['product_name', 'additives_en', 'bio']].fillna('')
# Explode the additives into separate rows
df['additives_en'] = df['additives_en'].apply(lambda x: list(x.split(",")))
df = df.explode('additives_en')
# Only keep the code of the additive (e.g. E115), discard the name
df['additives_en'] = df['additives_en'].apply(lambda x : x.split(' ')[0])
# Replace additives with their categories
df['additives_en'] = df['additives_en'].apply(extract_additive_type).map(map_func)
# Split bio and regular products
bio_df = df[df['bio'] == True]
regular_df = df[df['bio'] == False]
nb_bio = bio_df.shape[0]
nb_regular = regular_df.shape[0]
# Get the proportion of product containing the additives for bio products
summary_bio_df = bio_df[['product_name', 'additives_en']] \
.groupby('additives_en') \
.count() \
.sort_values('product_name', ascending=False) \
.reset_index() \
.rename(columns={'additives_en' : 'additive', "product_name" : "count"})
summary_bio_df['count'] = summary_bio_df['count'].apply(lambda x : x/nb_bio)
summary_bio_df = summary_bio_df.rename(columns={'count': 'proportion'})
summary_bio_df['type'] = 'bio'
# Get the proportion of product containing the additives for regular products
summary_regular_df = regular_df[['product_name', 'additives_en']] \
.groupby('additives_en') \
.count() \
.sort_values('product_name', ascending=False) \
.reset_index() \
.rename(columns={'additives_en' : 'additive', "product_name" : "count"})
summary_regular_df['count'] = summary_regular_df['count'].apply(lambda x : x/nb_regular)
summary_regular_df = summary_regular_df.rename(columns={'count': 'proportion'})
summary_regular_df['type'] = 'regular'
summary_df = pd.concat([summary_bio_df, summary_regular_df])
summary_df.reset_index(level=0, drop = True, inplace=True)
summary_df.rename(columns={'index': 'score'}, inplace=True)
# Add single category for filtering later in the plots
if cat == 'all products':
summary_df['category'] = 'all products'
elif subcat is None and cat in parent_categories_df_dict:
summary_df['category'] = cat
elif subcat and cat in parent_categories_df_dict and subcat in sub_categories_df_dict[cat]:
summary_df['category'] = subcat
return summary_df
def compute_all_categories_additives_summary():
"""
Compute a final dataframe with all the categories displaying their additive content.
It uses the previous function compute_category_additives_summary(...) and calls it for each product category (meat, dairy, ...).
Output:
dataframe with columns: "additive", "proportion", "type", "category"
score: nutri-score values
proportion: proportion of products with given grade (within the score)
type: "bio" or "regular"
category: category of product (meat, dairy, ...)
"""
summary_df = compute_category_additives_summary(cat='all products', subcat=None)
for category in sub_categories_df_dict:
summary_df = pd.concat([summary_df, compute_category_additives_summary(cat=category, subcat=None)])
for subcategory in sub_categories_df_dict[category]:
summary_df = pd.concat([summary_df,
compute_category_additives_summary(cat=category, subcat=subcategory)])
summary_df.reset_index(level=0, drop = True, inplace=True)
return summary_df
def fill_missing_additives(df):
"""
Add a proportion of 0 for the additives that are not present in a given category.
This addition is required to have homegeneous plots afterward.
Output:
dataframe with columns: "additive", "proportion", "type", "category"
score: nutri-score values
proportion: proportion of products with given grade (within the score)
type: "bio" or "regular"
category: category of product (meat, dairy, ...)
"""
additives_list = df['additive'].unique()
product_categories = df['category'].unique()
for cat in product_categories:
for product_type in ('bio', 'regular'):
for additive in additives_list:
present_additives = df[(df['category'] == cat)
& (df['type'] == product_type)]['additive'].values
# Add additive with value 0 if not present in the category
if additive not in present_additives:
df = df.append(pd.Series([additive, 0.0, product_type, cat], index=df.columns), ignore_index=True)
return df
Then we apply the previously defined processing functions and get a filtered dataframe with respect to the additives !
additive_summary_df = compute_all_categories_additives_summary()
additive_summary_df = fill_missing_additives(additive_summary_df)
additive_summary_df
We decided to drop the antibiotics additive groups as simply numerically the proportions associated with this group was constantly 0.
# Drop the Antibiotics, there are nowhere to be found anyway
additive_summary_df = additive_summary_df[additive_summary_df['additive'] != 'Antibiotics']
Now that the dataframe is defined, we can use the previously defined utilities function to create an interactive plot for the additive analysis !
The team decided to use a radar chart for this specific visualization as it looks pretty neat and that the limited number of edges for this analysis makes the plot very readable.
updatemenus = [
{
'buttons': list(),
'direction': 'down',
'showactive': True,
'x':0.05,
'xanchor':"left",
'y':1.15,
'yanchor':"top"
}
]
fig = go.Figure()
n_cat = len(list(categories.keys()))
vect_parent = []
for cat in categories.keys():
fig = compute_sub_element(fig, additive_summary_df, cat, 'radar')
vect = np.full(n_cat*2, False, dtype=bool)
vect_parent = construct_bool_visibility(vect, cat, list(categories.keys()))
updatemenus[0]['buttons'].append({'method': 'update',
'label': cat,
'args' : [
{'visible': vect_parent}
]})
fig.update_layout(updatemenus=updatemenus,
title=dict(text='Proportion of product containing the additive',
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
polar = dict(radialaxis_angle = 90,
radialaxis = dict(tickangle = 90), #type="log",
bgcolor='rgba(0.5,0.5,0.5,0.1)',
)
)
fig.update_yaxes()
fig.show()
fig.write_html('additiveplot.html')
generate_url(fig, 'additives', auto_open=False)
In this section, we will conduct similar investigations to the ones made about the additives, but we will look this time to the nutrients of the products' categories.
Let us first observe the data's completion in general when it comes to nutrients.
# create a copy from the original dataset
nutrition_food_fact_df=food_facts_df
# Only keep fields of type float since nutrients are displayed with numerical values
nutrition_food_fact_df = nutrition_food_fact_df.select_dtypes(include=[np.float])
# Add bio column back
nutrition_food_fact_df = nutrition_food_fact_df.join(food_facts_df['bio'],how='left')
Let us observe the number of NaN per category:
# Display the number of NaNs per field
NaNs_distribution_series = nutrition_food_fact_df.isnull().sum().sort_values()
print(NaNs_distribution_series.head(30))
del NaNs_distribution_series
All fields contain important proportions of NaN. This will possibly limit the accuracy of our observations.
An important point is that the proportions of NaN should be close when comparing organic and regular products. Let us verify the proportions on NaN for both product types (hence organic and standard):
temp_df = nutrition_food_fact_df[nutrition_food_fact_df['bio']==True]
print('Organic proportion of NaN:',temp_df.isnull().sum().sum()/(temp_df.shape[0]*temp_df.shape[1]))
temp_df = nutrition_food_fact_df[nutrition_food_fact_df['bio']==False]
print('Regular proportion of NaN:',temp_df.isnull().sum().sum()/(temp_df.shape[0]*temp_df.shape[1]))
Overall, the proportions of NaN is high in both cases when considering the whole dataset. However, if we want to evaluate the compostion of the products, we need to make sure that the proportions of NaN for individual nutrients is balanced as well between organic and regular products. Let us observe this for each category:
temp_df = nutrition_food_fact_df[nutrition_food_fact_df['bio']==True]
bio_nan_sr = temp_df.isnull().sum()/(temp_df.shape[0])
temp_df = nutrition_food_fact_df[nutrition_food_fact_df['bio']==False]
regular_nan_sr = temp_df.isnull().sum()/(temp_df.shape[0])
nutrient_wise_nan_proportion_df = pd.concat([bio_nan_sr,regular_nan_sr],axis=1)
nutrient_wise_nan_proportion_df = nutrient_wise_nan_proportion_df.rename(columns={0: "bio", 1: "regular"})
col = nutrient_wise_nan_proportion_df.loc[:,['bio','regular']]
nutrient_wise_nan_proportion_df['average'] = col.mean(axis=1)
Display the proportions of NaN for each nutrient, once for the organic products and once for regular ones:
nutrient_wise_nan_proportion_df.sort_values('average').iloc[1:60]
The proportions of NaN is not always similar for organic and regular products. It is therefore important to properly select the nutrients that can be used to compare the two classes of products. This will be taken into account at a later point.
Good reference for grouping the nutrients can be found here (resource in French), as they are directly used for computing nutritional scores:
Since our goal is to visualize the distribution of nutrients between organic and regular products, we need to select a subset of nutrients in order not to overload the visualization. The following addresses this.
The following dictionary is used to map the nutrients whose names contain any of the values of the dictionary.
mapping_dict = {
'fat': ['saturated-fat','cholesterol'],
'sugar': ['sugar'],
'salt': ['salt'],
'fiber': ['fiber'],
'protein':['protein'],
}
The following function identifies whether the input string contains any of the value of the previously defined dictionary and return the corresponding key if there is a match.
def rename_nutrient(s):
"""
This function apply the mapping for a column string name: if it contains one of the strings defined
in the values of the mapping dictionary. It retruns the corresponding key string.
Otherwise it returns None: This will allow to discard the nutrients that are not part of the dictionary key.
Argument:
s: (string) string of a nutrient.
Output:
(string) corresponding key of the mapping_dict defined above.
"""
for (key,value) in mapping_dict.items():
for v in value:
if s.find(v)>=0:
return key
return None
The following functions generate the data describing the nutrients proportions for each category. Each function is documented individually.
def analyze_nutrients(df,category):
'''
This function returns two pandas series with a mean value per nutrient (in gram for 100g) after removing outliers.
This is done for organic and regular products seperately.
Arguments:
df: (dataframe) sample of the open_food_fact dataframe, typically a product category.
category: (string) category of product (dairy, cereal based, ...)
Output:
organic and regular series containing the proportions of each nutrient in the class.
'''
# Only keep fields of type float since nutrients are displayed with numerical values
nutrition_food_fact_df = df[category].select_dtypes(include=[np.float])
# Add bio column back
nutrition_food_fact_df = nutrition_food_fact_df.join(df[category]['bio'],how='left')
# Some columns are still irrelevant to our analysis
col_to_drop = [ 'cities', 'allergens_en', 'serving_quantity', 'no_nutriments', 'additives_n',
'ingredients_from_palm_oil_n', 'ingredients_from_palm_oil',
'ingredients_that_may_be_from_palm_oil_n', 'ingredients_that_may_be_from_palm_oil',
'nova_group', 'energy-kj_100g', 'energy-kcal_100g', 'energy_100g', 'energy-from-fat_100g',
'collagen-meat-protein-ratio_100g','carbon-footprint_100g', 'carbon-footprint-from-meat-or-fish_100g',
'nutrition-score-fr_100g', 'nutrition-score-uk_100g', 'glycemic-index_100g', 'water-hardness_100g']
nutrition_food_fact_df = nutrition_food_fact_df.drop(columns=col_to_drop)
# list keeped columns
columns_keeped= list(nutrition_food_fact_df.columns)
# Separate bio from non-bio
nutrition_facts_bio_df = nutrition_food_fact_df[nutrition_food_fact_df['bio'] == True]
nutrition_facts_non_bio_df = nutrition_food_fact_df[nutrition_food_fact_df['bio'] == False]
# Drop bio column
nutrition_facts_bio_df=nutrition_facts_bio_df.drop(columns='bio')
nutrition_facts_non_bio_df=nutrition_facts_non_bio_df.drop(columns='bio')
# Listing the columns keeped
columns_keeped=list(nutrition_facts_bio_df.columns)
nutrition_facts_bio_clean_df= dict()
nutrition_facts_non_bio_clean_df= dict()
for column in columns_keeped:
# for each column are between 0 and 100 [g], values outside this range must be discarded.
# for each column we compute the mean over all the reamining rows after filtering
## bio
temporary_ds=nutrition_facts_bio_df[nutrition_facts_bio_df[column]<=100][column]
nutrition_facts_bio_clean_df[column]=temporary_ds[temporary_ds.values>=0].mean()
## Regular
temporary_ds=nutrition_facts_non_bio_df[nutrition_facts_non_bio_df[column]<=100][column]
nutrition_facts_non_bio_clean_df[column]=temporary_ds[temporary_ds.values>=0].mean()
# return both series as pandas serires
nutrition_bio=pd.DataFrame.from_dict(nutrition_facts_bio_clean_df,orient='index')
nutrition_non_bio=pd.DataFrame.from_dict(nutrition_facts_non_bio_clean_df,orient='index')
return nutrition_bio,nutrition_non_bio
def group_nutrients(df):
'''
Handles outputs of analyze_nutrients(). This function is used to map the selected nutrients whose names contain
any of the values of the mapping dictionary defined above.
'''
nutrients_grouping_df = df.reset_index().rename(columns={'index':'nutrient',0:'proportion'})
nutrients_grouping_df['nutrient'] = nutrients_grouping_df['nutrient'].apply(rename_nutrient).dropna()
summed_nutrients_grouping_df = nutrients_grouping_df
return summed_nutrients_grouping_df.groupby('nutrient').mean()
Now, we will aplly our pipeline to organic and regular products for each category. We start by calling the analyze_nutrients function that returns the average of each column over all the products after cleaning. This is done for regular and organic products seperately.
Finally we build our dataframe for plotting the products' average nutrients quantity per 100g.
nutriment_radar_data = pd.DataFrame()
for category in list(categories.keys()):
# analyze nutrients for regular and organic products
bio_df,non_bio_df = analyze_nutrients(parent_categories_df_dict,category)
# mapping to the selected nutrients for regular and organic products
bio_data_df = group_nutrients(bio_df)
non_bio_data_df = group_nutrients(non_bio_df)
# Defining the type and category of each product for plotting
bio_data_df['type'] = 'bio'
non_bio_data_df['type'] = 'regular'
bio_data_df['category'] = category
non_bio_data_df['category'] = category
nutriment_radar_data = pd.concat([nutriment_radar_data, pd.concat([bio_data_df,non_bio_data_df],axis=0)])
nutriment_radar_data = nutriment_radar_data.reset_index()
# Minor modification to adapt the data to the plotting method
nutriment_radar_data['additive'] = nutriment_radar_data['nutrient']
nutriment_radar_data = nutriment_radar_data.drop(columns=['nutrient'])
nutriment_radar_data.head(10)
Well ! we have now prepared our dataframe for plotting. Let's plot our interactive radar plot:
updatemenus = [
{
'buttons': list(),
'direction': 'down',
'showactive': True,
'x':0.1,
'xanchor':"left",
'y':1.1,
'yanchor':"top"
}
]
fig = go.Figure()
n_cat = len(list(categories.keys()))
vect_parent = []
for cat in categories.keys():
fig = compute_sub_element(fig, nutriment_radar_data, cat, 'radar')
vect = np.full(n_cat*2, False, dtype=bool)
vect_parent = construct_bool_visibility(vect, cat, list(categories.keys()))
updatemenus[0]['buttons'].append({'method': 'update',
'label': cat,
'args' : [
{'visible': vect_parent}
]})
fig.update_layout(updatemenus=updatemenus,
title=dict(text="Products' average nutrients quantity per 100g [g]",
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
polar = dict(radialaxis_angle = 90,
radialaxis = dict(tickangle = 90), # type="log",
bgcolor='rgba(0.5,0.5,0.5,0.1)'
)
)
fig.update_yaxes()
fig.show()
fig.write_html('compositionplot.html')
generate_url(fig, 'nutrients', auto_open=False)
This section is kind of a bonus section. We generated word-clouds per product category. This is used to provide an intuitive overview of the categories of products we generated.
To install the latest version of the library:
git clone https://github.com/amueller/word_cloud.git
cd word_cloud
pip install .
The following generates a word-cloud for each category of products and create an image for each of them.
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image
def create_wordcloud(words, image, save_path=None):
# Wordcloud expects the transparency color to be white with alpha=0
image[(image == (0, 0, 0, 0)).all(axis=2)] = (255, 255, 255, 0)
result_wordcloud = WordCloud(stopwords=stop_words, mask=image, background_color='white', mode="RGBA",
prefer_horizontal=0.8, max_words=1000).generate(words)
# create coloring from image
image_colors = ImageColorGenerator(image)
plt.figure(figsize=[7,7])
plt.imshow(result_wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
plt.axis("off")
if save_path:
result_wordcloud.to_file(save_path)
plt.show()
# Defining the list of image names taken for wordcloud
image_path="images/"
images_names=["fish.png","apple.jpg","pain.png","drink.jpg","cow.jpg","cooking.jpg","huile.png","sweet.jpg"]
# Listing the categories
category_list = list(parent_categories_df_dict.keys())
# Creating word cloud for each image with the corresponding category i.e fish.png --> meat
for index,category in enumerate(category_list):
image = (Image.open(image_path+images_names[index]))
# A simple conversion to RGBA to match different format of the input image
image=np.array(image.convert('RGBA'))
# Getting the corresponding category
df = parent_categories_df_dict[category]
# preparing the bag of words for each catefory to build the wordcloud
words = ' '.join(df['product_name'].astype(str).values)
create_wordcloud(words, image=image, save_path=image_path+images_names[index][:-4]+"_wc.png")
We merge all the previous images into one using an external program (Krita) and then plot the final image using plotly, to be able to zoom into it.
scale_factor = 1.0
local_image_path = 'images/wordcloud_categories.png'
# We need to put this URL in the plot so that it is displayed in the datastory
remote_image_url = 'https://antoineweber.github.io/ADA_Project_RobAda/images/wordcloud_categories.png'
image_width, image_height = np.array(Image.open(local_image_path)).shape[0:2]
fig = go.Figure()
# add invisible trace
fig.add_trace(
go.Scatter(
x=[0, image_width * scale_factor],
y=[0, image_height * scale_factor],
mode="markers",
marker_opacity=0
)
)
fig.add_layout_image(go.layout.Image(
x=0,
sizex=image_width * scale_factor,
y=image_height * scale_factor,
sizey=image_height * scale_factor,
xref='x',
yref='y',
opacity=1.0,
layer='below',
sizing='stretch',
source=remote_image_url))
fig.update_layout(
title=dict(text='Wordcloud of product categories',
xanchor= 'center',
yanchor= 'top',
x = 0.5,
y = 0.95),
autosize=False,
width=700,
height=700,
plot_bgcolor='rgba(0,0,0,0)')
# Remove axis
fig.update_xaxes(
visible=False,
range=[0, image_width * scale_factor]
)
fig.update_yaxes(
visible=False,
range=[0, image_height * scale_factor]
)
# Disable the autosize on double click because it adds unwanted margins around the image
fig.show(config={'doubleClick': 'reset'})
generate_url(fig, 'category_wordcloud', auto_open=False)